The data that we have is from Kaggle and it shows the car features for each model of car from 1990 to 2017, which includes most of the car brands in the market such as “Audi”, “BMW”, “Tesla”, “Toyota”, “Nissan” etc. Also includes most of the models for each brand. The year column indicates the year that the model was produced. Engine fuel type is categorical data indicating the type of fuel the engine is required, the levels include “diesel”, “electric”, “premium unleaded (required)”, “regular unleaded” and so on, a total of 11 types of fuels. Engine horsepower indicates how powerful the car engine is, which is an important element in our next analysis. The engine cylinder shows the number of cylinders the engine has. Transmission type, driven wheels and number of doors are pretty straightforward. The market category shows how the automobile market defines the specific model, which can include tags such as “High-Performance”, “Luxury”, “Hybrid” etc. can also mix and match any number of tags. Vehicle size is measured by the total volume of the car, it has three levels “Compact”, “Large” and “Midsize”. Vehicle style shows whether the car is a regular sedan, coupe, cargo van or wagon, it has a total of 16 levels. Highway MPG and city MPG are measurements to show how fuel efficiency that car is. The popularity index shows how well the car was sold. Finally, the MSRP indicates the manufacturer’s suggested retail price of the car. As we can see, the dataset has a total of 11914 rows and 16 columns, which is sufficient for our later analysis.
data <- read.csv("datafile.csv", header = T)
# Remove empty rows and duplicate rows
car <- data[complete.cases(data), ]
car <- car[!duplicated(car), ]
# Remove outliers
Q1 <- quantile(car$MSRP, .25)
Q3 <- quantile(car$MSRP, .75)
IQR <- IQR(car$MSRP)
car <- subset(car, car$MSRP > (Q1 - 1.5*IQR) & car$MSRP < (Q3 + 1.5*IQR))
# Remove unused columns and keep only the following columns
columns <- c("Make",
"Year",
"Engine.HP",
"Engine.Cylinders",
"highway.MPG",
"city.mpg",
"Popularity",
"MSRP")
car <- car[, columns]
attach(car)
head(car)## Make Year Engine.HP Engine.Cylinders highway.MPG city.mpg Popularity MSRP
## 1 BMW 2011 335 6 26 19 3916 46135
## 2 BMW 2011 300 6 28 19 3916 40650
## 3 BMW 2011 300 6 28 20 3916 36350
## 4 BMW 2011 230 6 28 18 3916 29450
## 5 BMW 2011 230 6 28 18 3916 34500
## 6 BMW 2012 230 6 28 18 3916 31200
The price of a car is the key that most people care about. So, the goal of this project is to analyze the relationships between each feature of the car and the manufacturer’s suggested retail price of the car. Which feature is highly correlated with the price and which one is not. With the correlation, we can even predict the price of a car with some key features. Since there is some feature that is not needed, such as the model which is an unknown factor of the car, each make can have a large variance between all the models. So, the features that we care about are “Make”, “Year”, “Engine horsepower”, “Engine cylinders”, “highway.MPG”, “city.mpg” and “Popularity”. We are going to find out which of these features can have the highest impact on the price of the car.
For the MSRP distribution we have here, it is clear that the highest appearance of the price range is between 2000 – 2999, 1137 cars are filling in this range. Besides that part, the rest of the data appears in a relatively normal shape, with a right tail, that is because there are some luxury and performance cars have a high retail price, and the price range for most of the daily used cars is between 20,600 and 39,000 with a mean of 29,500 and a median of 29,000. The minimum car retail price is 2,000 dollars, and the maximum is 75,200. This is because we removed the outliers, with the outlier the maximum price could exceed 2 million. This makes sense, as we expected most of the cars in the market today are between 20,000 to 40,000 US dollars and some luxury or performance cars have an extremely high price even above a million. And the minimum price we obtained is economy cars with production year before 2000.
library(plotly)
plot_ly(car, x = ~MSRP, type="histogram", name = 'MSRP')The central limit theorem in probability theory states that the distribution of a variable relative to its size increases as its sample size increases. This is done so assuming that all samples are the same size. CLT states that if large sample size is obtained, the mean of all the variables from that population will be roughly equal to that of the whole population, it also states that even though the distribution of the data is abnormal, the mean of the data is still within the mean of the population. In other words, if the sample size gets larger, the variance of the variables will also be equal to that of the population. In our case, we chose to use 4 different sample sizes, 10, 20, 30 and 40 which are randomly picked in the dataset. In our case, the population mean is 29,466.32 and the standard deviation is 15909.84. As we can see in the graph, the mean of each sample size is close to the population mean but as the sample size goes up, the standard deviation is getting smaller and smaller, and the distribution is getting even more narrow. This matches the central limit theorem, as the sample size gets larger, the shape of the sampling distribution becomes relatively normal.
means1 = rep(NA, 5000)
set.seed(544)
for(i in 1:5000){
means1[i] = mean(sample(car$MSRP, size = 10))
}
means2 = rep(NA, 5000)
set.seed(544)
for(i in 1:5000){
means2[i] = mean(sample(car$MSRP, size = 20))
}
means3 = rep(NA, 5000)
set.seed(544)
for(i in 1:5000){
means3[i] = mean(sample(car$MSRP, size = 30))
}
means4 = rep(NA, 5000)
set.seed(544)
for(i in 1:5000){
means4[i] = mean(sample(car$MSRP, size = 40))
}
cat(paste0("MSRP mean = ",round(mean(car$MSRP),2), ", SD = ", round(sd(car$MSRP),2),"\n"),
paste0("Sample size = 10, mean = ",round(mean(means1),2), ", SD = ", round(sd(means1),2),"\n"),
paste0("Sample size = 20, mean = ",round(mean(means2),2), ", SD = ", round(sd(means2),2),"\n"),
paste0("Sample size = 30, mean = ",round(mean(means3),2), ", SD = ", round(sd(means3),2),"\n"),
paste0("Sample size = 40, mean = ",round(mean(means4),2), ", SD = ", round(sd(means4),2)),sep = "")## MSRP mean = 29466.32, SD = 15909.84
## Sample size = 10, mean = 29470.11, SD = 5001.86
## Sample size = 20, mean = 29451.15, SD = 3545.45
## Sample size = 30, mean = 29416.12, SD = 2914.68
## Sample size = 40, mean = 29441.33, SD = 2483.19
subplot(plot_ly(x = means1, type="histogram", name = 'Sample size = 10'),
plot_ly(x = means2, type="histogram", name = 'Sample size = 20'),
plot_ly(x = means3, type="histogram", name = 'Sample size = 30'),
plot_ly(x = means4, type="histogram", name = 'Sample size = 40'),nrows = 2)The following table is a summary of MSRP based on different car makes. The highest average MSRP make is the well-known luxury brand, Maserati. The average retail price of Maserati is 71,000 dollars. We can see that it also has the highest minimum price, which means the variance of Maserati is small, this indicates that for most Maserati models, the price range is between 70,000 and 72,000 dollars. While Cadillac tells us a different story, as we can see in the table, although Cadillac has the fourth-highest average retail price, it has a relatively large variance as the minimum price for Cadillac is only 2000 dollars, and the highest Cadillac model has a retail price of 58,000 dollars. This can be caused by 2 reasons, first reason could be the dataset of Cadillac contains some extremely old models, produced before the year 2000. This makes sense as we mentioned earlier, the car price has a huge gap before and after the year 2000. Another reason is that there is a huge gap in retail prices between models in Cadillac. So, we did some inspection of the Cadillac to find out the year of the model whose retail price is below 1.5 IQR of the first quantile. The result shows our first guess is correct, that all the models that have a low retail price are produced before the year 2000, which means that all models made after the year 2000 have a relatively reasonable variance. When we look at the boxplot for the make vs. MSRP, we found that Mercedes, Porsche, SAAB and Volvo, these brands have a ridiculous interquartile range, which means the middle 50% of data points of these brands are widely spread, so that the price of each model have relatively big differences.
library(dplyr)
a <- aggregate(MSRP~Make, data = car, summary)
a %>% arrange(desc(MSRP[,"Mean"]))## Make MSRP.Min. MSRP.1st Qu. MSRP.Median MSRP.Mean MSRP.3rd Qu.
## 1 Maserati 69800.000 70400.000 71100.000 71000.000 71700.000
## 2 Lotus 43995.000 54990.000 65595.000 61717.750 68412.500
## 3 Alfa Romeo 53900.000 55900.000 63900.000 61600.000 65900.000
## 4 Cadillac 2000.000 44190.000 48777.500 49089.994 57595.000
## 5 BMW 4697.000 40425.000 46450.000 47150.061 56950.000
## 6 Genesis 41400.000 42650.000 43900.000 46616.667 49225.000
## 7 Land Rover 2561.000 37110.000 45570.000 45715.220 54950.000
## 8 Porsche 2667.000 7419.000 53900.000 45383.244 63900.000
## 9 Lincoln 2000.000 38301.250 42095.000 43118.847 47482.500
## 10 Infiniti 2000.000 35925.000 43300.000 42498.957 50400.000
## 11 Lexus 2000.000 36517.500 41900.000 41755.858 50417.500
## 12 Audi 2000.000 36850.000 43175.000 41593.393 53100.000
## 13 Mercedes-Benz 2000.000 4213.000 43325.000 37545.836 56037.500
## 14 HUMMER 30750.000 33390.000 36015.000 36464.412 39290.000
## 15 Acura 2000.000 23845.000 35905.000 33690.963 44800.000
## 16 GMC 2000.000 26328.750 32760.000 32444.085 40040.000
## 17 Volvo 2000.000 2179.000 37775.000 29724.684 43950.000
## 18 Buick 2000.000 26205.000 31027.500 29034.189 35845.000
## 19 Toyota 2000.000 21785.000 29985.000 28531.518 36115.000
## 20 Volkswagen 2000.000 24232.500 27700.000 28238.380 31191.250
## 21 Ford 2000.000 21941.250 29557.500 28222.931 36507.500
## 22 Chevrolet 2000.000 18934.250 26520.000 27991.537 36148.750
## 23 Saab 2000.000 2122.000 34975.000 27879.807 40400.000
## 24 Nissan 2000.000 20960.000 28570.000 27502.974 36840.000
## 25 Chrysler 2000.000 22055.000 29185.000 26722.963 34955.000
## 26 Honda 2000.000 22210.000 26150.000 26608.884 32550.000
## 27 Kia 2000.000 17741.250 23820.000 25318.750 31492.500
## 28 Hyundai 2000.000 19175.000 23600.000 24926.263 30100.000
## 29 Subaru 2000.000 21995.000 25645.000 24240.674 29995.000
## 30 Dodge 2000.000 17158.750 24477.500 22626.794 30495.000
## 31 FIAT 15990.000 19995.000 21880.000 22206.017 24697.500
## 32 Mitsubishi 2000.000 18395.000 23495.000 21316.351 27495.000
## 33 Mazda 2000.000 16050.000 22517.500 20106.556 26391.250
## 34 Scion 12480.000 16471.250 19205.000 19932.500 23190.000
## 35 Pontiac 2000.000 14930.000 22500.000 19800.044 27700.000
## 36 Suzuki 2000.000 15099.000 17642.000 18026.415 23524.000
## 37 Oldsmobile 2000.000 2000.000 2179.000 12843.795 23103.750
## 38 Plymouth 2000.000 2000.000 2000.000 3296.873 2000.000
## MSRP.Max.
## 1 72000.000
## 2 74995.000
## 3 68400.000
## 4 75010.000
## 5 74700.000
## 6 54550.000
## 7 71450.000
## 8 75200.000
## 9 74260.000
## 10 67050.000
## 11 72520.000
## 12 74100.000
## 13 75000.000
## 14 43130.000
## 15 65950.000
## 16 71665.000
## 17 65700.000
## 18 49625.000
## 19 65215.000
## 20 74600.000
## 21 68996.000
## 22 75195.000
## 23 51330.000
## 24 60490.000
## 25 49470.000
## 26 47070.000
## 27 61900.000
## 28 68750.000
## 29 39995.000
## 30 65945.000
## 31 29235.000
## 32 38995.000
## 33 44015.000
## 34 31090.000
## 35 37610.000
## 36 31749.000
## 37 36795.000
## 38 44625.000
cadi <- car[which(Make == "Cadillac"), ]
Q1.cadi <- quantile(cadi$MSRP, .25)
Q3.cadi <- quantile(cadi$MSRP, .75)
IQR.cadi <- IQR(cadi$MSRP)
cadi[which(cadi$MSRP < (Q1.cadi - 1.5*IQR.cadi)), "Year"]## [1] 1991 1992 1993 1990 1991 1992 1999 2000 2000 2000 2000 1994 1995 1996 1993
plot_ly(car, x = ~Make, y = ~MSRP, type="box", name = 'MSRP')Below is the summary table for different numbers of the engine cylinder. As we can see in the table, despite the 0- and 12-cylinder cars, the rest of the cars follow the rule the higher the number of engine cylinders is the higher the retail price. 0 cylinder indicates that the car is electric, which also makes sense since generally speaking, electric cars are more expensive than gas cars. But the weird thing here is the 12-cylinder cars, is has the lowest average retail price. Our guess is for those 12-cylinder cars, the production year is before 2000. And the second table proved our hypothesis, we can see that all 12-cylinder cars are produced before the year 2000.
aggregate(MSRP~Engine.Cylinders, data = car, summary)## Engine.Cylinders MSRP.Min. MSRP.1st Qu. MSRP.Median MSRP.Mean MSRP.3rd Qu.
## 1 0 25120.00 25560.00 39900.00 34511.92 42400.00
## 2 3 2000.00 2000.00 13495.00 9698.00 14907.50
## 3 4 2000.00 17845.00 23800.00 23981.28 30100.00
## 4 5 2000.00 2232.00 24950.00 21765.60 33390.00
## 5 6 2000.00 25105.00 33095.00 32491.45 41990.00
## 6 8 2000.00 30432.50 39350.00 38628.92 51835.00
## 7 10 68340.00 68340.00 68340.00 68340.00 68340.00
## 8 12 2960.00 3185.25 5240.50 5353.25 7121.50
## MSRP.Max.
## 1 43600.00
## 2 18175.00
## 3 74995.00
## 4 57200.00
## 5 75200.00
## 6 75195.00
## 7 68340.00
## 8 8488.00
car[which(car$Engine.Cylinders == 12), ]## Make Year Engine.HP Engine.Cylinders highway.MPG city.mpg
## 671 Mercedes-Benz 1992 402 12 14 10
## 672 Mercedes-Benz 1993 389 12 15 11
## 673 Mercedes-Benz 1993 389 12 16 12
## 674 Mercedes-Benz 1993 389 12 15 11
## 761 BMW 1995 322 12 19 12
## 762 BMW 1995 372 12 18 10
## 763 BMW 1996 322 12 19 12
## 765 BMW 1997 322 12 19 12
## Popularity MSRP
## 671 617 2960
## 672 617 3108
## 673 617 4623
## 674 617 3211
## 761 3916 5858
## 762 3916 8488
## 763 3916 6954
## 765 3916 7624
plot_ly(car, x = ~Engine.Cylinders, y = ~MSRP, type="box", name = 'MSRP')As we mentioned earlier, we noticed that the price before and after the year 2000 has a huge gap. Therefore, to prove this hypothesis, we made a visualization to show the trend of the price. And below is the line graph of the year versus MSRP. As we expected, there is a huge gap before and after the year 2000.
year_of_MSRP_mean<-aggregate(MSRP~Year, data = car, FUN = mean)
plot_ly(year_of_MSRP_mean,x = ~Year, y = ~MSRP, type = "scatter", mode = "lines+markers", name = "Mean of MSRP for different years")There is a total of 5 sampling methods. Like putting everyone’s name into a hat, random sampling is similar to drawing out several names. However, it is usually much harder to do since it requires a complete list of all the elements in the population. Unlike random sampling, systematic sampling is usually easier to do since it eliminates the list of all the elements. This method, which is similar to lining everyone up, eliminates the list of all the elements and uses all the numbers in the population. Although convenience sampling is generally easier to do, it is probably the worst technique for gathering data. In convenience sampling, the data is obtained by running into the first people that the survey encounters. The goal of cluster sampling is to divide the population into groups, which are usually geographically dispersed. Each element in the group is randomly selected. Another type of sampling that involves separating the population into groups is stratified sampling. For instance, if the population is composed of males and females, then a sample is taken from each stratum. The goal of sampling is to find a method that is likely to yield a representative sample of the population. As we can see in the below graph, the four sampling methods we used can clearly represent the population distribution, they are extremely similar in shape.
library(sampling)
set.seed(544)
sample1 = car[sample(1:nrow(car), size = 500), ]
N <- nrow(car)
n <- 500
k <- ceiling(N/n)
set.seed(544)
r = sample(k, 1)
sample2 <- car[seq(r, by = k, length = n),]
pik <- inclusionprobabilities(car$Popularity,500)
sample3 <- car[UPsystematic(pik) != 0,]
car1 <- car[order(car$Year),]
freq <- table(car1$Year)
st.sizes <- 500 * freq / sum(freq)
st.sizes <- as.vector(t(st.sizes))
st.sizes <- st.sizes[st.sizes!=0]
st.4 <- strata(car1, stratanames = "Year", size = st.sizes, method = "srswor")
sample4 <- getdata(car1, st.4)
subplot(plot_ly(x = car$MSRP, type="histogram", name = "Frequency of MSRP"),
plot_ly(x = sample1$MSRP, type="histogram", name = "Simple random sampling without replacement"),
plot_ly(x = sample2$MSRP, type="histogram", name = "Systematic sampling"),
plot_ly(x = sample3$MSRP, type="histogram", name = "Systematic sampling with unequal probabilities(Popularity)"),
plot_ly(x = sample4$MSRP, type="histogram", name = "Stratified sampling(Year)"),nrows = 3)As the linear model shows, the relationship between each feature and retail price is clear. As the year, engine horsepower and fuel efficiency go up, the price goes up. As engine cylinders and popularity go up the price goes down. The engine cylinders and popularity might cause some misunderstanding, engine cylinders have an inverse relationship to MSRP is because of the existence of the 12-cylinder engine as we mentioned earlier, the popularity has an inverse relationship is because the cheaper car is much more popular than those luxury cars. And there are 73% of the data can be presented by this model.
m <- lm(MSRP ~ Year + Engine.Cylinders + Engine.HP + Popularity + city.mpg)
summary(m)##
## Call:
## lm(formula = MSRP ~ Year + Engine.Cylinders + Engine.HP + Popularity +
## city.mpg)
##
## Residuals:
## Min 1Q Median 3Q Max
## -31714 -5364 -878 4817 42279
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -1.826e+06 2.998e+04 -60.893 < 2e-16 ***
## Year 9.099e+02 1.499e+01 60.703 < 2e-16 ***
## Engine.Cylinders -5.958e+02 1.033e+02 -5.766 8.35e-09 ***
## Engine.HP 1.234e+02 1.847e+00 66.801 < 2e-16 ***
## Popularity -6.004e-01 5.706e-02 -10.521 < 2e-16 ***
## city.mpg 7.877e+01 1.684e+01 4.678 2.93e-06 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 8298 on 10148 degrees of freedom
## Multiple R-squared: 0.7281, Adjusted R-squared: 0.728
## F-statistic: 5435 on 5 and 10148 DF, p-value: < 2.2e-16